17 research outputs found

    Wheels within Wheels: Making Fault Management Cost-Effective

    Get PDF
    Local design and optimization of the components of a fault management system results in sub-optimal decisions. This means that the target system will likely not meet its objectives (under-performs) or cost too much if conditions, objectives, or constraints change. We can fix this by applying a nested, management system for the fault-management system itself. We believe that doing so will produce a more resilient, self-aware, system that can operate more effectively across a wider range of conditions, and provide better behavior at closer to optimal cost. This document summarizes the results of the Working Group 7 - ``Cost-Effective Fault Management\u27\u27 - at the Dagstuhl Seminar 09201 ``Self-Healing and Self-Adaptive Systems\u27\u27 (organized by A. Andrzejak, K. Geihs, O. Shehory and J. Wilkes). The seminar was held from May 10th 2009 to May 15th 2009 in Schloss Dagstuhl~--~Leibniz Center for Informatics

    Modeling Event-driven Time Series with Generalized Hidden Semi-Markov Models

    Get PDF
    This report introduces a new model for event-driven temporal sequence processing: Generalized Hidden Semi-Markov Models (GHSMMs). GHSMMs are an extension of hidden Markov models to continuous time that builds on turning the stochastic process of hidden state traversals into a semi-Markov process. A large variety of probability distributions can be used to specify transition durations. It is shown how GHSMMs can be used to address the principle problems of temporal sequence processing: sequence generation, sequence recognition and sequence prediction. Additionally, an algorithm is described how the parameters of GHSMMs can be determined from a set of training data: The Baum-Welch algorithm is extended by an embedded expectation-maximization algorithm. Under some conditions the procedure can be simplified to the estimation of distribution moments. A proof of convergence and a complexity assessment are provided.Not Reviewe

    Reliability Modeling of Proactive Fault Handling

    Get PDF
    Research on dependable computing is undergoing a shift from traditional fault tolerance towards techniques that handle faults proactively. These techniques comprise two parts: (a) prediction of failures and (b) actions that are performed in case of an upcoming failure. This work provides the first reliability model that incorporates both correct and false predictions as well as both types of actions: failure prevention and recovery preparation. Closed form solutions to availability, reliability and hazard rate are provided

    Advanced Failure Prediction in Complex Software Systems

    Get PDF
    The availability of software systems can be increased by preventive measures which are triggered by failure prediction mechanisms. In this paper we present and evaluate two non-parametric techniques which model and predict the occurrence of failures as a function of discrete and continuous measurements of system variables. We employ two modelling approaches: an extended Markov chain model and a function approximation technique utilising universal basis functions (UBF). The presented modelling methods are data driven rather than analytical and can handle large amounts of variables and data. Both modelling techniques have been applied to real data of a commercial telecommunication platform. The data includes event-based log files and time continuously measured system states. Results are presented in terms of precision, recall, F-Measure and cumulative cost. We compare our results to standard techniques such as linear ARMA models. Our findings suggest significantly improved forecasting performance compared to alternative approaches. By using the presented modelling techniques the software availability may be improved by an order of magnitude

    Error Log Processing for Accurate Failure Prediction

    No full text
    Error logs are a fruitful source of information both for diagnosis as well as for proactive fault handling – however elaborate data preparation is necessary to filter out valuable pieces of information. In addition to the usage of well-known techniques, we propose three algorithms: (a) assignment of error IDs to error messages based on Levenshtein’s edit distance, (b) a clustering approach to group similar error sequences, and (c) a statistical noise filtering algorithm. By experiments using data of a commercial telecommunication system we show that data preparation is an important step to achieve accurate error-based online failure prediction.

    Predicting failures of computer systems: a case study for a telecommunication system

    No full text
    The goal of online failure prediction is to forecast imminent failures while the system is running. This paper compares Similar Events Prediction (SEP) with two other well-known techniques for online failure prediction: a straightforward method that is based on a reliability model and Dispersion Frame Technique (DFT). SEP is based on recognition of failure-prone patterns utilizing a semi-Markov chain in combination with clustering. We applied the approaches to real data of a commercial telecommunication system. Results are presented in terms of precision, recall, F-measure and accumulated runtime-cost. The results suggest a significantly improved forecasting performance. 1
    corecore